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Abstract 

The kernel-independent fast multipole method (KIFMM) proposed in [l[ is of almost 
linear complexity. In the original KIFMM the time-consuming M2L translations are 
accelerated by FFT. However, when more equivalent points are used to achieve higher 
accuracy, the efficiency of the FFT approach tends to be lower because more auxiliary 
volume grid points have to be added. In this paper, all the translations of the KIFMM 
are accelerated by using the singular value decomposition (SVD) based on the low rank 
property of the translating matrices. The acceleration of M2L is realized by first trans- 
forming the associated translating matrices into more compact form, and then using 
low-rank approximations. By using the transform matrices for M2L, the orders of the 
translating matrices in upward and downward passes are also reduced. The improved 
KIFMM is then applied to accelerate BEM. The performance of the proposed algorithms 
are demonstrated by three examples. Numerical results show that, compared with the 
original KIFMM, the present method can reduce the CPU time for matrix-vector multi- 
plication by a factor of 4 and memory usage by a factor of 2. 

Keywords: boundary element method; kernel-independent fast multipole method; 
singular value decomposition; matrix compression 



1. Introduction 

The boundary clement method (BEM) has become a promising numerical method 
in computational science and engineering. Despite many unique advantages, like the 
dimension reduction, high accuracy and suitability for treating infinite domain problems, 
a major disadvantage of the BEM is its dense system matrix which solution cost is 
prohibitive in large-scale problems. During the past three decades, several acceleration 
methods have been proposed to circumvent this disadvantage. Representative examples 
arc the fast multipole method (FMM)0|, wavelet compression method^], H-matrixQ, 
adaptive cross approximation (ACA)[I|, pre-corrected FFT etc. Among them the 
FMM is no doubt the most outstanding one. 
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The conventional FMM is originally proposed to accelerate the iV-body simulations, 
which requires the analytical expansions of the kernel functions. This poses a severe 
limitation on its applications to many problems where the analytical expansions are 
hard to be obtained. Besides, the kernel-tailored expansion makes it difficult to develop 
a universal FMM code for real-world applications. To overcome this drawback, the 
kernel- independent FMM (KIFMM) has been proposed in the past decade 0, 0, 0] • A 
salient feature of the KIFMM is that the expansion of the kernel function is no longer 
required. Instead, only a user-defined function for kernel value evaluation is needed; the 
structure of the FMM acceleration algorithm is in common for many typical problems. 

In this paper, the KIFMM proposed by Ying et al [l[ is concerned. This method uses 
equivalent densities in lieu of the analytical expansions. It provides a unified framework 
for fast summations with the Laplace, Stokes, Navier and similar kernel functions. Due 
to its ease-of-use and high efficiency, it has attracted the attention of many researchers 
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The moment-to-local (M2L) translation is the most time-consuming part of the FMM 
[H, [H, H, H 0, El- In the KIFMM [l| the M2L translation is accelerated by the fast 
Fourier transform (FFT), leading to 0(p 3 log p) computational complexity, where p is 
the number of equivalent points along the cube side. However, one should note that 
the efficiency of the FFT approach tends to become lower when p increases. This is 
because the equivalent points lie only on the boundary of each box, while to use the FFT 
Cartesian grid points interior the box must be considered as well. In this paper, the M2L 
in KIFMM is compressed and accelerated using the singular value decomposition (SVD); 
see Section[3] This method is built on the fact that the M2L matrices are typically of very 
low numerical ranks. Our numerical experiments, including those in Section^ show that 
the proposed method is more efficient than the FFT approach. Another advantage of the 
SVD accelerating approach is that it is more flexible than the FFT approach, because 
the later requires the equivalent and check points to be equally spaced while this is not 
needed in the SVD approach. Moreover, as a byproduct, the orders of the translating 
matrices in the upward and downward passes can also be reduced by using the transform 
matrices of M2L, leading to further reduction of the CPU time and memory usage for 
the upward and downward passes. 

The original KIFMM in [l| is designed to accelerate the potential evaluation for par- 
ticle simulations. Recently, this original method was applied to solve boundar y in tegral 
equations (BIEs) in, e.g., blood flow, molecular electrostatic problems [3, \VA\M- It is 
noticed that the central idea of all those works is to translate the far-field interactions to 
a particle summation formulation so that the original KIFMM can be used in a straight- 
forward manner. For example, Ref. (l6j deals with large-scale blood flow problem. The 
velocity of each red blood cell is divided into two components, namely the velocity of a 
reference point and the relative velocity reflecting the self deformation of the cell. By 
doing this, the interactions between red blood cells can be formulated into "particle 
summations" corresponding to the reference points for all the blood cells, and thus the 
KIFMM can be used. In [l7| , the Nystrom method is used to discretize the BIE in order 
to obtain the particle summation form. 

As is well known, the BEM is an dominate numerical method for solving BIEs and 
has profound applications in engineer. In this paper, the KIFMM is used to accelerate 
the BEM. This work is nontrivial since the KIFMM can not be straightly used in BEM 
due to the presence of elements, let alone to maintain the accuracy and efficiency. For 
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example, the equivalent and check surfaces are crucial components of the KIFMM. In 
the original KIFMM these surfaces can be set as the surfaces of each cube. However, in 
BEM setting this choice would deteriorate the accuracy, because the boundary elements 
belonging to a cube can often extrude from the cube; see Section B~T1 for the details in 
choosing those surfaces. 

2. Basic idea of the KIFMM 

The KIFMM was proposed in [l[ to solve the potential problems for particles. Here 
its framework is briefly reviewed. 

2.1. Setup 

Assume that there are N source densities {qi} located at N points {y;}. Then the 
induced field potential {pi} at points {x^} is given by 

N N 

Pi =p(xi) = ^G(xi, yj)g(yj) = ^G^qj, i = 1,2, • • • ,N, (1) 

3=1 3=1 

where, G(x, y) is the kernel function, which can be of the single layer, double layer or 
other layers. The complexity is obviously 0(N 2 ) if the potentials are evaluated naively 
by a order N matrix-vector multiplication. By using the FMMs this complexity can be 
reduced to O(N). 

The central to all the FMMs lies in the low-rank approximation (LRA) of the subma- 
trices representing the far-field interactions. The efficient realization of the LRA relies 
on a spacial tree structure. To construct the tree, all the particles are first included into 
a root level cube. Then the cube is equally divided into 8 cubes, generating the cubes 
in the next level. This subdivision is continued until the particles contained in each leaf 
cube is no more than a predetermined number s. 

For each cube C, let JV C denote its near field which consists of cubes in the same 
level that share at least one vertex with C; the union of the other cubes is defined to be 
its far field J^ c '. Let B denote the parent cube of C, then the interaction field of C is 
defined as J :C = & \& B . Let y C ' u denote the upward equivalent surface corresponding 
to cube C, x c,u denote the upward check surface, y C ' d denote the downward equivalent 
surface and x C ' d denote the downward check surface To guarantee the existence of 
the equivalent densities and check potentials, these surfaces must satisfy the following 
conditions: 

1. y c,u and x c,u lie between C and J^" , x C ' u encloses y ' u ; 

2. y c ' d and x C ' d lie between C and '; y C ' d encloses x C ' d ; 

3. y C ' u encloses y B,u where B is C"s child; 

4. y c ' n is disjoint from y S ' d for all B in ,^ B ] 

5. y c,d lies inside y B ' d where B is C"s parent. 

The above conditions can be satisfied by choosing all the related surfaces be the 
boundaries of cubes that are concentric with the cube. For each cube C with side length 
2r, y C ' u and x c ' d can be chosen as the boundary of the cube with halfwidth (1 + d)r, 
x c ' u and y c,d as the boundary of the cube with halfwidth (3 — 2d)r, where < d < |. 
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Therefore, the distance between the equivalent surface and the check surfaces involved in 
each translation is no less than (2 — 3(i)r. This relation is used in the original KIFMM [l| , 
with d being of a small value. In this way the equivalent surface and the check surfaces 
are well-separated, and high accuracy can be obtained. However, when the KIFMM is 
applied to BEM, d has to be set larger, or a large part of the source densities on elements 
belonging to C may extrude from its upward equivalent surface y ,u , and the sources 
belonging to cube C can not be well represented by its equivalent densities. Thus the 
size of the elements should be considered in defining these surfaces for BEM. See Section 

ED 

2.2. Far field translations 

Generally, in a FMM, the potentials induced by the sources in the near field are 
computed directly by ((T|), which is named as source-to-target (S2T) translation. The po- 
tentials induced by the sources in the far field are efficiently evaluated by a series of trans- 
lations, named as source-to-multipole (S2M), multipole-to-multipole (M2M), multipole-to- 
local (M2L), local-to-local (L2L) and local-to-target (L2T) translations. The main feature 
of the KIFMM lies in that the above translations are performed using equivalent densi- 
ties and check potentials, while in the conventional FMM the translations are performed 
using the multipole expansions and local expansions. The algorithm for evaluating the 
potential contribution of far-field sources in KIFMM is as follows. 

1. S2M. The source densities q in a leaf cube B are translated into its upward equiv- 
alent densities q B ' u ; that is, 

q S ' u = Sq, (2) 

with S being the translating matrix [![. 

2. M2M. The upward equivalent densities q B ' u of a cube B are transformed to the 
upward equivalent densities q c,u of its parent C, 

q c < u = Mq S ' u , (3) 

with M being the translating matrix. 

3. M2L. The upward equivalent densities q c,u of cube C are translated to the down- 
ward check potentials p D,d of cube D G J* c in its interaction field 

p D > d = Kq c ' u , (4) 

where, the translating matrix K is computed as 

Kij = G(xi,yj), 

with Xi being the i-th downward check point of D and y 3 - being the j-th upward 
equivalent point of C. 

4. L2L. The downward check potentials p D,d of cube D are translated to the downward 
check potentials of its child cube E, 

p*- d = Lp^ d , (5) 



with L being the translating matrix. 
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5. L2T. The downward check potentials p E ' d of leaf cube E are translated to the 
potentials p on the target points in E, 

P = TP EA , (6) 
with T being the translating matrix. 

Combining equations ([2])-([6]), the potential p on the target points in a leaf cube induced 
by the source densities q in another leaf cube in its far field can be computed as 

p = TLKMSq. (7) 

The M2L translation Q is the most time-consuming step in the KIFMM. It is ac- 
celerated by FFT in the original KIFMM [l|. In its implementation auxiliary points 
must be added inside the upward equivalent surface and the downward check surface, 
although one only needs the upward equivalent points and the downward check points 
on the corresponding surfaces. This makes the FFT approach less efficient when the 
number of the equivalent points and check points are large because the auxiliary points 
will account for a large proportion. In the next section, a SVD approach is proposed to 
accelerate the M2L translations as well as other translations. 



3. SVD-based acceleration for translations 

In this section, a new SVD-based accelerating technique is proposed, which can com- 
press all the transform matrices in KIFMM, thus both the M2L translation and the 
upward and downward passes are greatly accelerated. 



3.1. Matrix dimension reduction for M2L 

In the acceleration for M2L, SVD is applied in two stages. In the first stage, the M2L 
translating matrices are compressed into more compact forms. 

Suppose that the kernel function is translational invariant. The union of unique 
translating matrices over all cubes in each level forms a set of 316 matrices. To compress 
these matrices, first collect them into a fat matrix in which they are aligned in a single 
row and a thin matrix in which they are aligned in a single column 



K fat = K« K< 2 ' 



K 



thin 



K 



(i). 



K 



(2). 



K (316) 



K (316) 



where K' 1 ' is the i-th translating matrix. Perform SVD for these two matrices 
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AR 



(8a) 



(8b) 



(9a) 



(9b) 



Notice that in our algorithm, the entities of M2L matrices KW are the evaluations of 
single-layer kernel function. In most cases, they are symmetric, ic., = K", so (|§a|) 

and (j9b| are just transposes of each other, and the SVD has to be performed only once. 
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Consider the translating matrix for one translation 



U T K W R = SV (l)T R = U T Q W A. (10) 

Obviously, U T K^R decays both along the rows and columns as quickly as the singular 
values in X and A, thus it can be approximated by its submatrix U T KWR, therefore 

K = U(U T KR)R T ps U(U T KR)R T = UKR, (11) 

where, U and R are the the tailored matrices consisted by columns corresponding with 
dominant singular values that are not less than £i|jKf at ||2 = £iSo,o — EiAo.o, and K is 
the compressed translating matrix. Substituting (|TTj) into ycilds 

p = TLUKR T MSq. (12) 

Similar compression scheme was also used in Q . 

The compression (fTTj) is performed for M2L translating matrices at all levels. Let L 
denote the number of levels, then the computational complexity is O(L) ~ O(logiV). 
It can be reduced for the cases of homogeneous kernels. Assume that kernel function 
G(x, y) is homogeneous of degree of to, that is, G(ax, ay) = a m G(x, y) for any nonzero 
real a. Let (i — 1,2,..., 316) be the compressed translating matrices constructed 
from the interacting cubes that are scaled to have unit halfwidth. Then, the compressed 
translating matrices on the Z-th level can be computed efficiently by scaling 

K^gfKW (13) 

where, r is the halfwidth of the root cube in the octree. Therefore only Kq ^ has to 
be computed in the compressing procedure, and the computational complexity can be 
reduced into 0(1). 

The threshold £\ affects the balance between the computational cost and the accuracy 
of the algorithm. The induced error in each M2L translation is of order e-y, and the total 
error is approximately Le\ [l[. In order to maintain the error decreasing rate of BEM 
with piecewise constant element, Le\ should decrease by a factor of 2 with each mesh 
refinement 



In this paper, E\ is chosen by 



where, C\ is a constant coefficient. 



e 1 = C 1 ^-, (14) 



3.2. Further acceleration for M2L 

After the dimension reduction, it is found that most of the compressed M2L matrices 
K are still of low numerical ranks. For example, figure [1] illustrates the rank distribution 
of the interaction field of a cube C used in numerical example 1 5 . 1 1 wit h N = 2097152, p = 
8,Ci = 0.1, C2 = 100. The dimension of the original translating matrix is 296. After 
compression using U and R the dimension is reduced to 84. However, the figure clearly 
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shows that the actual numerical ranks of the matrices are still much lower than 84. This 
fact indicates that the computational cost of M2L can be further reduced by using the 
low rank decomposition of matrices K. Here the low rank decomposition is computed by 
SVD, so that optimal rank can be obtained. Since the number of the translating matrices 
is 0(1), this computational overhead is small. 
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Figure 1: Numerical rank distribution of M2L matrices Kg4 X g4 in numerical example 15.11 with TV = 
2097152, p = 8,Ci = 0.1, C 2 = 100. 

Consider the low rank approximations of the scaled matrices Ko for translational 

- u\ 

invariant and homogeneous kernels. Compute the SVD for each M2L matrix K , 
Truncate the singular values smaller than £2 1| Ko,f a t || 2 , and discard the corresponding 

(i) (i) 

columns in U and Q . Then the M2L translation can be approximated by 

P^= E (5)" l U«(S«Q«)qf- 

Cey ° (15) 

= e erw^ 

A (i) (i) 

where, S is the submatrix of S containing the dominant singular values that are no 
smaller than £2 1 1 Ko,f a t || 2 ; U ^ and Qq ar e the matrices consisted by the corresponding 
columns of TJq^ and , respectively; and Vq = S^' Qq . 

The error of approximation (TT5|) is determined by £2- Denote = Uq^Vq . From 
the truncating scheme, there exists 

||K«-K«|| 2 < £2 ||K , fat || 2 . 
For arbitrary m x n matrix A, one has |Aj| max < ||A||2 < \pnvn \ \ A|| max . Thus, 

||K« - K^'H^ < e 3 ||K fet || 2 . 
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Let Ko.fat an d Ko.fat denote the fat matrices for Ko and Ko, respectively, which are 
constructed similarly as Kf at in (|8a[) . It is easy to know that 



||Ko,fat — Ko,fat||max < ^2 || K 0i f a t || 2 • 

Since the dimension of Ko,f a t is p x 316p, where p is the dimension of Ko, and 

||K 0i f a t — Ko.fat || max > , ||Ko,f a t — K .fat || 2 , 

\/316p 2 

one has 

||K ,f at - K ,f at ||2 < e 2 >/316p 2 ||K O.fat |2- 

Therefore, the error introduced by the low rank approximation is ensured to be of same 
order as e± by letting 

ei £i 
e 2 ~ , ~ — ■ 

y/3l6p 2 P 

In our scheme, it is defined by 



S2 = C 2 £ 4, (16) 
P 



where C 2 is a constant coefficient. 



3.3. Acceleration for upward and downward passes 

The transformation matrices U and R can also be used to compress the matrices for 
upward and downward passes. 

Consider matrix product R T RMV. Note R T RM is the projection of matrix M to 
the space R spanned by the columns of R. For any matrix V, it can be divided into two 
parts 

V = V!+V 2 , 

where, Vi = R T RV is the projection of V to the space R, and V2 is the residual. 
Obviously R T RMV 2 = 0, thus 

R T RMV = R T RMVi = R T RMR T RV. (17) 

Similarly, for matrix product LU T UV, divide L into two parts 

L = Li + L 2 , 

where, Li = UUL and L 2 is the residual. Then L 2 U T UV = which leads to 

LU T UV = LiU T UV = U T ULU T UV. (18) 
Now consider Eq. (TT2j) 

p = TLUU T KRR T MSq. (19) 

According to relations (TTT|) and (TT51) . UU T can be inserted between T and L and RR T 
can be inserted between M and S. Thus 



p =TUU T LUU T KRR T MRR T Sq 

= (TU) (U T LU) (U T KR) (R T MR) (R T S)q (20) 
=TLKMSq, 



where, 

S = R T S 

is the new translating matrix for S2M; 

M = R T MR 
is the new translating matrix for M2M; 

L = U T LU 
is the new translating matrix for L2L; and 

T = TU 

is the new translating matrix for L2T. Since both the transformation matrices U and R 
are thin matrices, the new translating matrices S, M, L and T are smaller than their 
original matrices, and thus the computational cost of the upward and downward passes 
can be reduced. 



4. KIFMM for BEM 

In this section, the KIFMM is applied to accelerate the BEM. One should note that 
in BEM the sources distribute continuously on the boundary elements instead of on a 
group of discrete points in the original KIFMM [lj . Therefore, in the BEM the potential 
evaluation in S2T and S2M operations must be performed by integration rather than 
summation. More importantly since the continuous sources are represented by nodal 
basis functions, the sources has to be grouped based on the supports of nodal functions, 
which would make the definition of equivalent and check surfaces different from original 
KIFMM. 

For clarification in explanation, the single-layer BIE for Laplace problem is considered. 
Let O be a bounded domain with boundary T. Given a known potential /(x) on the 
boundary T, the source density distribution q(x) satisfies 

^G(x,y)g(y)dy = /(x), x g T, (21) 

where, G(x,y) = l/(47r|x — y|) is the fundamental solution of the Laplace equation. By 
partitioning the boundary T into triangular elements and using the piecewise constant 
basis functions with the nodal points on element centroids, the collocation BEM leads 
to a linear system 

Aq = b 

with q consisting of the source densities on each triangles, b consisting of the known 
potentials on each collocation points, and 

= [ G(x l ,y) Xj ( y )dy. (22) 

where, Xj is the i-th collocation point, Aj is the j-th triangle, and Xj(y) is the basis 
function on Aj. When the system solved by iterative methods, the main computational 
cost is spent on the matrix-vector multiplication (MVM) of which the complexity is 
0(N 2 ). This complexity can be reduced to 0(N) by the KIFMM. 
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4-1- The equivalent and check surfaces 

As mentioned above, the definition of the equivalent and check surfaces for BEM has 
to be different with the original KIFMM for particle simulations, because the equivalent 
surface must enclose all the sources according to the potential theory 

In constructing the octree for BEM, the centroids of elements are used as reference 
points, and the subdivision process is similar to that in Section |2~T1 Figure [5] illustrates 
a leaf cube C in the octree. The union of all the elements whose centroids lying in C 
is denoted by r(C). One should note that T(C) may extrude from C. As a result, for 
BEM the first two restrictions of surface definition in Section 12.11 have to be modified as 
follows: 

1. y c,u and x C ' u lie between T(C) and ,^ c ; x c < u encloses y c,u ; 

2. y CA and x c < d lie between C and T(&°), with T(& c ) being the union of all 
elements that belongs to ^ c \ y C ' d encloses x C ' d . 



Figure 2: The elements and the upward equivalent surface related to a leaf cube. 

The equivalent and check surfaces for BEM are defined similarly with Section 12.11 
However, in order to satisfy the above restrictions, the relative distance between a cube 
and its upward equivalent surface d has to be chosen large enough so that for each leaf 
cube, the triangles "belonging to" it are enclosed in its upward equivalent surface. For a 
quasi-uniform clement partition, assume that the size of the element is h and each leaf 
cube contains at most s elements, then the halfwidth of the leaf cubes in the finest level 
is proportional with sfsh. The distance between the outmost vertex and the cube surface 
is no larger than h, thus d is of order 



\fsh 

So, in this paper d is evaluated as 

d=C d ^=, (23) 

where, Cd is user-defined constant. Our numerical experience indicates that Cd = 0.5 is 
proper for most problems. 
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4-2. S2T and S2M translations 

The equivalent points and check points are sampled on the equivalent and check 
surfaces, respectively. Then the potentials on the collocation points can be evaluated 
efficiently by KIFMM, in which the contribution of the near-field sources are evaluated 
by S2T, and the contribution of the far-field sources are evaluated efficiently using the 
equivalent densities and check potentials. 

The potentials produced by near-field sources are evaluated by S2T translation. In the 
original KIFMM, it is performed by direct evaluation ([1]), since the sources distribute 
on discrete points. However, in BEM the sources distribute on elements, thus these 
potentials should be evaluated by integration 



= / G(x l5 y)g(y)dy 

= [ G(x 4 ,y) X ,(y)dy, A, e T(^ c ), 



where p(xj) is the check potential on the i-th collocation point. 

The potentials produced by far-field sources are evaluated by a series of translations, 
namely S2M, M2M, M2L, L2L and L2T. Among these translations, S2M need to evaluate 
the upward check potentials produced by the sources belonging to the leaf cube. Similar 
to S2T translation, this must be implemented by integration as well 



P C ' u (x)= / G(x,y)g(y)dy 
Jr(C) 

= 5>/ G(x,y) Xj (y)dy, A, e T(C) 



where, p C ' u (x) is the upward check potential for leaf cube C. 

In the above sections, the accelerating algorithm for single layer type boundary in- 
tegral is introduced. With slight modifications it can be used to accelerate double layer 
boundary integral. That is, only the integral kernel function in S2T translation and 
the first step in S2M translation should be replaced into double layer kernel. Therefore, 
upward equivalent densities should be solved by 

/ G(x,y)^ u (y)dy= / \ } g(y)dy, for any x G x c,u . (26) 

JyC" Jt{C) OXY{y) 

Discretized with upward check points and upward equivalent points, a linear system can 
be achieved, and the upward equivalent densities can be solved. The other steps of the 
algorithm remains the same with that dealing with single layer boundary integral. 

4-3. The complete algorithm 

In KIFMM for BEM, the discretized sources are grouped into cubes in a octree, then 
the potentials on collocation points are divided into two parts, namely the contribution 
of the near-field sources and the contribution of the far-field sources. The former is 
evaluated by S2T, while the later is evaluated by a series of translations. The complete 
algorithm for BEM is implemented by the following steps: 
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Algorithm SVD accelerated KIFMM for BEM 



Step 1 Setup 

1 Construct the octree by subdividing the leaf cube recursively. 

2 For each cube C, find the cubes in its near field jV and interaction field J' c . 

3 Define the equivalent and check surfaces by the method described in Section I4TT1 

4 Compute and compress the translating matrices by the compressing approach in 

Section [3] 
Step 2 Upward pass 

5 for each leaf cube C in postorder traversal of the tree do 

6 Compute the upward equivalent densities (S2M). 

7 end for 

8 for each non-leaf cube C in postorder traversal of the tree do 

9 Compute the upward equivalent densities (M2M). 

10 end for 

Step 3 Downward pass 

11 for each non-leaf cube C in preorder traversal of the tree do 

12 Add to the downward check potentials produced by the sources in its inter- 

action list (M2L) 

13 Add to the downward check potentials of its child cubes (L2L) 

14 end for 

15 for each leaf cube C in preorder traversal of the tree do 

16 Evaluate the potentials on the collocation points (L2T) 

17 end for 

Step 4 Near-field interaction 

18 for each leaf cube C in preorder traversal of the tree do 

19 Add to the potential the contribution of near field source densities (S2T), 

which should be evaluated by Eq. (|2"4")l 

20 end for 



In our method, the definition of the equivalent and check surfaces are different with 
the original KIFMM. However, this does not affect the computational cost. The total 
computational complexity of our new KIFMM for BEM remains O(N). 

5. Numerical Examples 

The performance of our SVD-based accelerating technique and the kernel-independent 
fast multipole BEM for Laplace BIEs is demonstrated by three numerical examples. The 
resulting linear systems are solved by GMRES solver. The algorithms are implemented 
based on the public kifmm3d code available from [l9[. All simulations are carried out on 
a computer with a Xeon 5440 (3.00 GHz) CPU and 28 GB RAM. 

5.1. Electrostatic problem 

In this subsection, the electric charge density on an ellipsoidal conductor is computed 
by solving Eq. (|2"Tj) . The ellipsoid can be described by (xi/2) 2 + x\ + (x 3 /3) 2 = 1. 
The analytic solution can be expressed analytically using ellipsoidal coordinates. The 
convergence tolerance for GMRES solver is set to be 10 -6 . The surface of the ellipsoid 
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Table 1: Errors obtained with different C'i and C2 



Relative error 



d=0.1 Ci=0.5 Ci=0.1 Cx = 0.1 Ci=0.1 
c 2 = c* 2 = c* 2 = 10 c 2 = 100 C* 2 = 500 

















P = 


4 
























512 





.069 


936 





.070 


235 


0.086 


715 





.071 


639 





.074 


210 





.074 


210 


2 


048 





.032 


898 





.033 


050 


0.033 


227 





.033 


080 





.039 


421 





.060 


588 


8 


192 





.014 


010 





.014 


047 


0.016 


236 





.014 


055 





.015 


502 





.062 


169 


32 


768 





.006 


697 





.006 


704 


0.007 


845 





.006 


702 





.007 


191 





.028 


167 


131 


072 





.003 


517 





.003 


518 


0.004 


236 





.003 


521 





.003 


720 





.009 


063 


524 


288 





.002 


960 





.002 


962 


0.003 


100 





.002 


966 





.003 


098 





.004 


419 


2 097 


152 





.004 


880 





.004 


880 


0.004 


887 





.004 


883 





.004 


896 





.005 


296 




512 





.069 


923 





.070 


P = 
111 


6 

0.086 


805 





.071 


593 





.074 


541 





.074 


541 


2 


048 





.032 


901 





.033 


038 


0.033 


270 





.033 


082 





.043 


255 





.062 


104 


8 


192 





.014 


001 





.014 


076 


0.016 


217 





.014 


081 





.016 


608 





.063 


492 


32 


768 





.006 


641 





.006 


679 


0.008 


849 





.006 


681 





.007 


300 





.028 


298 


131 


072 





.003 


182 





.003 


219 


0.003 


983 





.003 


220 





.003 


351 





.007 


635 


524 


288 





.001 


579 





.001 


587 


0.002 


905 





.001 


587 





.001 


607 





.002 


321 


2 097 


152 





.000 


790 





.000 


791 


0.001 


227 





.000 


791 





.000 


793 





.000 


932 




512 





.069 


923 





.070 


P = 
122 


8 

0.086 


831 





.071 


622 





.074 


681 





.074 


681 


2 


048 





.032 


901 





.033 


035 


0.033 


299 





.033 


075 





.046 


320 





.062 


675 


8 


192 





.014 


001 





.014 


065 


0.016 


208 





.014 


071 





.016 


880 





.064 


226 


32 


768 





.006 


641 





.006 


677 


0.008 


673 





.006 


670 





.007 


231 





.030 


980 


131 


072 





.003 


182 





.003 


216 


0.004 


141 





.003 


217 





.003 


344 





.009 


128 


524 


288 





.001 


579 





.001 


585 


0.002 


467 





.001 


585 





.001 


602 





.002 


632 


2 097 


152 





.000 


789 





.000 


790 


0.001 


014 





.000 


790 





.000 


793 





.000 


988 



is first discretized into N = 512 triangular elements, then the mesh is refined 6 times. 
The finest mesh has N = 2097152 elements. 

The accuracy and efficiency of the present KIFMM BEM arc mainly determined by 
parameters C\ in (|14p and C2 in (|16[) . The translating matrices are independent with 
the boundary since they are only determined by the position of the equivalent and check 
points which arc defined in the same manner as discussed in section |4~T1 The SVD accel- 
erating approach truncates small singular values of these translating matrices, therefore 
the induced error by SVD acceleration only depends on C\ and C 2 . Consequently, C\ and 
Ci should keep the same values for various boundary element analyses. From equation 
(|14p we know that with larger C\, more singular values are discarded, and the translat- 
ing matrices for M2L and upward and downward passes would be compressed into more 
compact form, thus the computing time could be reduced lower. While on the other hand 
the error would become larger. Similar conclusions could be made for C2. Consequently, 
the choices of C\ and C2 are determined by the tradeoff between the accuracy and the 
efficiency. 
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Table 2: CPU times in each iteration Ti t and the total memory usage with different Ci and C2 



N 


Tit (s) 


Memory usage 


(MB) 








= 0.1 




Ci - 


0.1 




c 2 


= 100 


c 2 = 


100 











p — 4 










512 


~ 


~ 


~ 


2.7 


1.8 


1.8 


2 


048 


0.02 


0.01 


0.01 


9.3 


7.1 


6.7 


8 


192 


0.10 


0.06 


0.05 


31.4 


27.8 


27.4 


32 


768 


0.37 


0.31 


0.24 


121.3 


116.9 


116.3 


131 


072 


1.57 


1.50 


1.10 


471.1 


466.4 


465.3 


524 


288 


6.30 


7.34 


5.25 


1 879.6 


1 892.8 


1 891.6 


2 097 


152 


18.25 


30.12 


24.32 


7 504.3 


7 598.9 


7 597.7 




512 


0.01 


~ 


p = 6 
~ 


6.1 


3.0 


3.0 


2 


048 


0.05 


0.01 


0.01 


19.8 


8.1 


7.8 


8 


192 


0.31 


0.06 


0.05 


55.8 


28.3 


27.9 


32 


768 


1.11 


0.30 


0.23 


186.4 


119.5 


118.9 


131 


072 


4.75 


1.78 


1.31 


736.5 


493.4 


492.1 


524 


288 


13.26 


8.56 


6.46 


2 917.5 


2 062.5 


2 060.8 


2 097 


152 


76.16 


45.51 


34.79 


11 669.0 


8 861.7 


8 859.8 




512 


0.01 


~ 


P = 8 
~ 


14.4 


7.7 


7.7 


2 


048 


0.17 


0.01 


0.01 


44.0 


12.8 


12.5 


8 


192 


0.70 


0.06 


0.04 


100.8 


33.0 


32.6 


32 


768 


2.98 


0.30 


0.17 


292.3 


124.2 


123.6 


131 


072 


12.50 


1.63 


1.21 


1 143.1 


489.5 


488.3 


524 


288 


49.09 


8.37 


4.57 


4 482.6 


2 067.1 


2 065.5 


2 097 


152 


198.03 


44.89 


34.73 


17 924.3 


8 866.4 


8 864.6 



First the influence of C\ on the accuracy of the algorithm is tested. Three cases with 
C\ being 0, 0.1 and 0.5 are computed. The results corresponding to C\ = arc computed 
using the original FFT-accelerating scheme in [l|. In all the three cases. C 2 is set to be 
0. 

The resulting errors are listed in the second to fourth columns of Table [TJ One can see 
that when C\ = 0.1 the errors are almost the same as that computed by FFT-accelerating 
scheme. However, when C± = 0.5 errors for p = 6, 8 are increased. This indicates that 
Ci = 0.1 is nearly optimal for retaining the accuracy. It is noticed that when p = 4 the 
error tends to be larger when the DOF is high. This is because the error of the algorithm 
depends on p and the depth L of the octree. To get high accuracy, p has to be increased 
to reduce the error in each translation; see [l[ for the details. 

The influence of C 2 is studied by setting C 2 = 10, 100, 500 while d = 0.1. In Table 
[T] it can be seen that for C 2 = 10 and C 2 = 100 the results keep almost the same errors; 
while for C 2 = 500 the errors increase. Although the errors raise with the increase of 
C 2 , the test case indicates that a choice of C 2 between 10 and 100 can maintains almost 
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the same accuracy. The CPU times T- lt in each iteration and the total memory usage of 
the two methods, FFT-accelerating approach and the SVD accelerating approach, are 
listed in Table [5J In Table [2J Jit can reduce considerably with larger C2 , but the memory 
cost only reduce slightly. The reason is that C2 only controls the accuracy and efficiency 
of the low-rank approximation for M2L matrices, as discussed in section 13.21 In this 
problem, since the kernel is translational invariant and homogeneous, the memory cost 
for M2L matrices is only of 0(1). Therefore, the memory reduction is negligible. 

From Table one can see that the CPU time in the SVD approach can be reduced 
significantly for large p, comparing with the FFT approach, since the CPU time for each 
iteration in the SVD approach is not sensitive to p. For example, the Tj t s of the SVD 
approach for the cases p = 6 are almost the same as that for p = 8. This is because the 
size of the compressed translating matrices are mainly determined by the compressing 
threshold e\ when p is large, and the numerical rank of M2L matrices are only determined 
by £2- Both £1 and £2 are independent with p. However, in the FFT approach more 
auxiliary points has to be added, which makes the FFT approach less efficient. Besides 
the CPU time, the memory usage can also be considerably reduced in the SVD approach, 
since the translating matrices used in S2M and L2T are compressed into more condensed 
form by the scheme in section f3. 31 

This example shows that the accuracy reduces with the increase of C\ and C2. When 
C\ = 0.1 and C2 = 10 the SVD accelerating approach is much more efficient than FFT- 
accelerating approach without significantly affecting the accuracy. C\ = 0.1 and C2 = 10 
will be used in the next numerical examples. 

5.2. A mixed boundary condition problem 

To demonstrate the performance of the SVD accelerating approach for more compli- 
cated geometry and boundary condition problems, Laplace equiation with mixed bound- 
ary conditions on a shaft model illustrated in Figure [3] is simulated. The analytical 
solution is set be to u = l/|xo — x|, with Xo being outside the computational domain. 
The potential u is given on the two end surfaces (red surfaces in Figure [3]) , and the 
flux q is given on the remainder (gray) surfaces. The controlling parameters are set as 
p = 6, C\ = 0.1, C2 = 10, and the converging tolerance for GMRES solver is set to be 
10~ 6 . 




Figure 3: A shaft model with mixed boundary condition. 
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The problem is solved by using the KIFMM BEM with the FFT approach and the 
SVD approach, respectively. The results are reported in Table [3] It can be seen that 
the SVD approach is more efficient and memory-saving than the FFT approach. The 
£ 2 -error of u decays linearly with 0(h) and the L 2 -error of q decays as 0{\fh). The time 
consuming in each iteration and the memory consuming increase almost linearly, which 
indicate that the computational complexity of the method is almost 0(N). 



Table 3: Performance of SVD accelerating approach and FFT-accclcrating approach 



N 


Error of u 


Error of q 


Ht (s) 


Memory (MB) 


FFT SVD 


FFT SVD 


FFT SVD 


FFT SVD 


18 048 
72 518 
288 768 
1 156 042 


0.021 521 0.020 222 
0.012 367 0.011 263 
0.004 648 0.004 505 
0.001 862 0.001 866 


0.062 718 0.066 147 
0.052 828 0.053 375 
0.003 173 0.003 580 
0.002 001 0.002 415 


0.73 0.20 
2.97 0.83 
7.15 4.52 
25.00 14.85 


115.1 77.6 
396.9 281.6 
1 532.3 1 105.4 
5 899.9 4 533.4 



5.3. Heat conduction problem 

To demonstrate the ability of the present KIFMM BEM for solving real-world prob- 
lems, a steady-state heat conduction analysis of a engine block is solved here; see figure 
5.31 The temperature field is governed by the Laplace equation. The conductivity of 
the engine block is A = 80W/(m ° C). The temperature of the inner surface of the 
oblique tube and the temperature of the bottom surface are set to be 75°C and 100°C, 
respectively. Convective condition with constant film coefficient h = 10W/(m 2 -° C) and 
constant bulk temperature To = 22°C are applied to the other surfaces. Simulations 
are performed using a series of meshes with number of elements ranging from 85 680 to 
nearly 5 million. The controlling parameters are set as p = 6,Ci = 0.1, Ci = 10. For 
comparison, this problem is also solved by finite element method (FEM) with 698317 
tctrahcdral elements, 1015653 nodes. The converging tolerance for GMRES solver is 
10" 4 . 

The CPU times and memory usage for different meshes are listed in Table [U where 
iVit and Tjt stand for the number of iterations and the CPU time for each iteration, 
respectively. Again one can see linear behavior of the CPU time and memory requirement. 
The computed temperature distribution using mesh with 325 774 elements is exhibited 
in Figure [4(b) | It can be seen that the temperature distribution obtained by the KIFMM 



BEM agrees very well with that by FEM in 4(a) 



It is noticed that, with the KIFMM BEM in this paper, the largest model with nearly 
5 million DOFs is successfully solved within 5 hours. 



6. Conclusion 

The FMM is one of the most successful fast algorithms for BEM acceleration. But 
it requires the analytical expansion of the kernel function, which makes it difficult to 
be applied to some complicated problems. Recently, various kernel- independent FMMs 
were developed to overcome this drawback. Among them the KIFMM proposed in [1] 
has high efficiency and accuracy, and thus has been extensively used [lj| EE Ell • The 
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Table 4: CPU times (s) and memory usage (MB) for engine-block heat conduction analysis 



N 


N it 


Itotal 


Tit 


Memory 


85 680 


90 


582.4 


3.8 


502.5 


325 774 


96 


2 111.1 


13.3 


1 735.5 


900 420 


100 


4 251.0 


27.5 


4 589.0 


1 370 880 


103 


5 946.3 


34.7 


7 374.8 


4 754 670 


97 


18 330.4 


108.2 


25 021.5 




(a) FEM result (b) KIFMM result 



Figure 4: Temperature distributions of the engine-block model computed by FEM and KIFMM BEM. 

KIFMM uses equivalent densities and check potentials instead of analytical expansions 
to construct the fast algorithm. The time consuming M2L translations arc accelerated 
by using the FFT. However, it is noticed that when more equivalent and check points 
are sampled to get higher accuracy, the efficiency of the FFT approach tends to be lower 
because more auxiliary volume grid points have to be added in order to do FFT. 

In this paper, the low rank property of the translating matrices in KIFMM is suffi- 
ciently exploited by SVD (called SVD approach in this paper) to accelerate all the transla- 
tions, including the most time-consuming M2L. The acceleration of the M2L translations 
is carried out in two stages. First the translating matrix is compressed into more compact 
form, and then it is approximated by low-rank decomposition. By using the compression 
matrices for M2L, the translating matrices in upward and downward passes can also be 
compressed into more compact form. Finally, the above improved KIFMM is applied 
to accelerate BEM, leading to a highly efficient KIFMM BEM for solving large-scale 
problems. 

The accuracy and efficiency of the SVD approach and the KIFMM BEM are demon- 
strated by three numerical examples. It is shown that, when compared with the FFT- 
accelerated KIFMM, the SVD approach can reduce the CPU time of one matrix-vector 
multiplication by a factor of 4, and can reduce the total memory requirement by a factor 
of 2. The presented KIFMM BEM is of O(N) complexity. By using this method Laplace 
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problem with nearly 5 million unknowns can be successfully solved within 5 hours on a 
Xeon-5440 2.83 GHz CPU and 28 GB RAM. 
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